Recently on this channel there was a video about using an offline coding assistant in vscode. There are a number of them out there, but I have had the most luck with Continue and with llama coder. Llama coder works in the code file, completing what you are typing, where as Continue will give you a chat interface with your code as context.
One of the common questions about that video is basically, what languages can I use with these assistants? And to answer that, we should look at what they are doing. So you can break down the process into a few steps.
First, you, the developer start writing some code. Maybe you add a comment and want the assistant to write whatever you asked for. Or maybe you want it to fill in the blank in the middle of some code. The assistant copies the code on the page and maybe some other pages for context. It then formats that code in a way that a model will expect to see it.
So what is the format? Well, many of the models use their own format with their own keywords. Figuring out how to build these models takes a long time and often the keywords and format weren’t published for other models when the researchers starting building the one you are using. The model developers aren’t trying to make our life hell…I think… I hope.
So what defines the format? The researchers go through a process called training that formats the inputs in a special way, feeds that to the model, and if the model answers the right way, adjusts the parameters of the model to try to ensure that it answers that way every time. And then it repeats that with a huge number of inputs and outputs and hopefully at the end we have a smart model. That’s a super simple way to look at it. I’ll cover training and what it means in another video. But the model expects all future input to look just like the training input so we have to stick with the special format.
For models that use the deepseek format, that looks like fim_begin and then your input. fim_hole which indicates where you want the answer to go. Then more input and finally fim_end. Those angle brackets and pipe characters that you see are important to the format, but hard to say so I skipped them. Having that fim_hole command is pretty cool. And filling in the middle is often referred to as Infilling.
So once it has formatted the prompt, it hands it off to the model. But tools like this often don’t actually run the models. In the case of Llama coder, its uses Ollama to run the model. So you need to have Ollama installed to run it. Ollama has a special endpoint it listens on for requests, gets the specially formatted prompt, and then outputs the answer, one token at a time. A token is roughly a word or common part of a word. Serving this back a token at a time is referred to as streaming.
The assistant takes that stream and outputs the answer to the code window.
So let’s use llama coder to see this in action. In settings for llama coder, we start with an ollama server endpoint. If you are running ollama on the same machine as vscode, this will be blank. Next is the model to use. You can see that as of this recording at the end of January 2024, there is stable code, code llama, and deepseek coder. These are all special models that focus on writing code. And they all have formats that allow for filling in the blanks in the middle of a code block. Next is temperature. This is often associated with ‘creativity’, though that’s not really a great term for it. Models work by guessing the first word then figuring out what is the most likely next word. Having a higher temperature means that the most likely next word is not always guaranteed, and you may have something very different. If you want to use a different model you can specify it in custom, along with the format to use. This is great if you have a fine tune based on deepseek, stable code, or code llama which will use the same format. The final two options limit how much can be written in one swoop.
So now you can start writing code or add a comment. when you press the space key and nothing else for a second, it will feed what you are working on to the model. And that grey text is what it suggests you might want to use. You can accept the full suggestion with tab, or just the next word of the suggestion with command and arrow. I assume that’s going to be control arrow on windows and linux.
But when we did that we didn’t see any special format that the assistant sent to the model. Is that just a black box? Well, it used to be. Let me show you something that will probably come out in 0.1.23 or another future version. If I quit the Ollama server and then build ollama myself and run that with the ollama debug = 1 environment variable, then I get extra stuff in the logs. Now i will start ollama run in another window and ask my favorite prompt. I am pretty sure I was the first one to use this in this space and now its pretty common, though if you find someone doing it first, i will ignore it and keep thinking I am special. Why is the sky blue? and we see my answer. Now go back to the server logs and scroll up, and we can see the prompt in the logs, along with the full answer. This is great because you can see how the template was applied to the prompt.
Now let’s try using vscode.
And if we look at the logs, we can see the fim begin, fim hole, and fim end, just as i described. There are a lot more entries here, because its running every time i press the space key and then pause for a moment. But the concept is the same.
So that’s how this stuff works. But i still didn’t really tell you what languages are supported. The best way to figure that out is to go back to the ollama ai page for the model and then find the link to the huggingface repo for the model and read their docs. For deepseek coder, that points us to the github repo for the model. And that shows us a list of programming languages supported. It looks like its pretty much every language you can think off, plus a few I hadn’t heard of. Idris? Isabelle? Bluespec? Augeas? So yeah, your language is probably there. Though here is a page full of languages not supported.
for stable code, the list of languages is actually on the ollama.ai model page, a shorter list of 18 languages. Codellama seems to be a much shorter list, but I could be certain on what the list actually is
All of the models reference various benchmarks that all prove that their model is the best, but if you have seen my videos before you know my opinion on the benchmarks. they all suck wind. To understand what is the best model for you you are going to have to try them all. Also try the different sizes of the models. But I tend to choose the smallest that gives a good answer. Having to wait for a model to generate an answers means I won’t and will just keep typing. And I really don’t want to peg my gpu constantly, especially when on battery, because my m1 all day battery will be closer to my intel mac book pro with 45 minute battery and I don’t want to go back to that.